由于其效率,一声神经架构搜索(NAS)已被广泛用于发现架构。但是,先前的研究表明,由于架构之间的操作参数过度共享(即大共享范围),架构的一声绩效估计可能与他们在独立培训中的表现没有很好的相关性。因此,最近的方法构建了更高参数化的超级链,以降低共享程度。但是这些改进的方法引入了大量额外的参数,因此在培训成本和排名质量之间导致不良的权衡。为了减轻上述问题,我们建议将课程学习应用于共享范围(接近),以有效地训练超级网。具体而言,我们在一开始就以很大的共享范围(简单的课程)训练超网,并逐渐降低了超级网的共享程度(更难的课程)。为了支持这种培训策略,我们设计了一个新颖的超级网(闭合性),该超级网(CLESENET)将参数从操作中解耦,以实现灵活的共享方案和可调节的共享范围。广泛的实验表明,与其他一击的超级网络相比,Close可以在不同的计算预算限制中获得更好的排名质量,并且在与各种搜索策略结合使用时能够发现出色的体系结构。代码可从https://github.com/walkerning/aw_nas获得。
translated by 谷歌翻译
深度卷积神经网络的最新研究呈现了一个简单的架构设计范式,即,具有更多MAC的模型通常达到更好的准确性,例如有效网络和REGNET。这些作品试图通过采样和统计方法将模型中的所有阶段放大。然而,我们观察到一些网络架构具有类似的MAC和准确性,但它们对不同阶段计算的分配是完全不同的。在本文中,我们建议通过提高阶段水平的宽度,深度和分辨率来扩大CNN模型的容量。在假设顶部执行较小的CNN是顶部执行较大的CNN的适当子组件,我们提出了一种基于计算的重新分配的贪婪网络放大方法。通过逐步修改不同阶段的计算,放大的网络将配备最佳分配和Mac的使用。在Abseralnet上,我们的方法始终如一地优于原始缩放方法的性能。特别是,通过在Ghostnet上应用我们的方法,我们可以分别实现最先进的80.9%和84.3%的想象成的上1个高精度,分别为600m和4.4b Mac。
translated by 谷歌翻译
我们研究了在联合环境中从积极和未标记的(PU)数据中学习的问题,由于资源和时间的限制,每个客户仅标记其数据集的一小部分。与传统的PU学习中的设置不同,负面类是由单个类组成的,而由客户在联合设置中无法识别的否定样本可能来自客户未知的多个类。因此,在这种情况下,几乎无法应用现有的PU学习方法。为了解决这个问题,我们提出了一个新颖的框架,即使用正面和未标记的数据(FEDPU)联合学习,以通过利用其他客户的标记数据来最大程度地降低多个负面类别的预期风险。我们理论上分析了拟议的FedPU的概括结合。经验实验表明,FedPU比常规监督和半监督联盟的学习方法取得更好的性能。
translated by 谷歌翻译
由于现代硬件的计算能力强烈增加,在大规模数据集上学习的预训练的深度学习模型(例如,BERT,GPT-3)已经显示了它们对传统方法的有效性。巨大进展主要促进了变压器及其变体架构的代表能力。在本文中,我们研究了低级计算机视觉任务(例如,去噪,超级分辨率和派没),并开发了一个新的预先训练的模型,即图像处理变压器(IPT)。为了最大限度地挖掘变压器的能力,我们展示了利用众所周知的想象网基准,以产生大量损坏的图像对。 IPT模型在具有多头和多尾的这些图像上培训。此外,引入了对比度学习,以适应不同的图像处理任务。因此,在微调后,预先训练的模型可以有效地在所需的任务上使用。只有一个预先训练的模型,IPT优于当前的最先进方法对各种低级基准。代码可在https://github.com/huawei-noah/pretrate -ipt和https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/cv/ipt
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Masked image modeling (MIM) has shown great promise for self-supervised learning (SSL) yet been criticized for learning inefficiency. We believe the insufficient utilization of training signals should be responsible. To alleviate this issue, we introduce a conceptually simple yet learning-efficient MIM training scheme, termed Disjoint Masking with Joint Distillation (DMJD). For disjoint masking (DM), we sequentially sample multiple masked views per image in a mini-batch with the disjoint regulation to raise the usage of tokens for reconstruction in each image while keeping the masking rate of each view. For joint distillation (JD), we adopt a dual branch architecture to respectively predict invisible (masked) and visible (unmasked) tokens with superior learning targets. Rooting in orthogonal perspectives for training efficiency improvement, DM and JD cooperatively accelerate the training convergence yet not sacrificing the model generalization ability. Concretely, DM can train ViT with half of the effective training epochs (3.7 times less time-consuming) to report competitive performance. With JD, our DMJD clearly improves the linear probing classification accuracy over ConvMAE by 5.8%. On fine-grained downstream tasks like semantic segmentation, object detection, etc., our DMJD also presents superior generalization compared with state-of-the-art SSL methods. The code and model will be made public at https://github.com/mx-mark/DMJD.
translated by 谷歌翻译
Recently, great progress has been made in single-image super-resolution (SISR) based on deep learning technology. However, the existing methods usually require a large computational cost. Meanwhile, the activation function will cause some features of the intermediate layer to be lost. Therefore, it is a challenge to make the model lightweight while reducing the impact of intermediate feature loss on the reconstruction quality. In this paper, we propose a Feature Interaction Weighted Hybrid Network (FIWHN) to alleviate the above problem. Specifically, FIWHN consists of a series of novel Wide-residual Distillation Interaction Blocks (WDIB) as the backbone, where every third WDIBs form a Feature shuffle Weighted Group (FSWG) by mutual information mixing and fusion. In addition, to mitigate the adverse effects of intermediate feature loss on the reconstruction results, we introduced a well-designed Wide Convolutional Residual Weighting (WCRW) and Wide Identical Residual Weighting (WIRW) units in WDIB, and effectively cross-fused features of different finenesses through a Wide-residual Distillation Connection (WRDC) framework and a Self-Calibrating Fusion (SCF) unit. Finally, to complement the global features lacking in the CNN model, we introduced the Transformer into our model and explored a new way of combining the CNN and Transformer. Extensive quantitative and qualitative experiments on low-level and high-level tasks show that our proposed FIWHN can achieve a good balance between performance and efficiency, and is more conducive to downstream tasks to solve problems in low-pixel scenarios.
translated by 谷歌翻译
Rigorous guarantees about the performance of predictive algorithms are necessary in order to ensure their responsible use. Previous work has largely focused on bounding the expected loss of a predictor, but this is not sufficient in many risk-sensitive applications where the distribution of errors is important. In this work, we propose a flexible framework to produce a family of bounds on quantiles of the loss distribution incurred by a predictor. Our method takes advantage of the order statistics of the observed loss values rather than relying on the sample mean alone. We show that a quantile is an informative way of quantifying predictive performance, and that our framework applies to a variety of quantile-based metrics, each targeting important subsets of the data distribution. We analyze the theoretical properties of our proposed method and demonstrate its ability to rigorously control loss quantiles on several real-world datasets.
translated by 谷歌翻译
Recently, large-scale pre-trained models have shown their advantages in many tasks. However, due to the huge computational complexity and storage requirements, it is challenging to apply the large-scale model to real scenes. A common solution is knowledge distillation which regards the large-scale model as a teacher model and helps to train a small student model to obtain a competitive performance. Cross-task Knowledge distillation expands the application scenarios of the large-scale pre-trained model. Existing knowledge distillation works focus on directly mimicking the final prediction or the intermediate layers of the teacher model, which represent the global-level characteristics and are task-specific. To alleviate the constraint of different label spaces, capturing invariant intrinsic local object characteristics (such as the shape characteristics of the leg and tail of the cattle and horse) plays a key role. Considering the complexity and variability of real scene tasks, we propose a Prototype-guided Cross-task Knowledge Distillation (ProC-KD) approach to transfer the intrinsic local-level object knowledge of a large-scale teacher network to various task scenarios. First, to better transfer the generalized knowledge in the teacher model in cross-task scenarios, we propose a prototype learning module to learn from the essential feature representation of objects in the teacher model. Secondly, for diverse downstream tasks, we propose a task-adaptive feature augmentation module to enhance the features of the student model with the learned generalization prototype features and guide the training of the student model to improve its generalization ability. The experimental results on various visual tasks demonstrate the effectiveness of our approach for large-scale model cross-task knowledge distillation scenes.
translated by 谷歌翻译
Crowd counting plays an important role in risk perception and early warning, traffic control and scene statistical analysis. The challenges of crowd counting in highly dense and complex scenes lie in the mutual occlusion of the human body parts, the large variation of the body scales and the complexity of imaging conditions. Deep learning based head detection is a promising method for crowd counting. However the highly concerned object detection networks cannot be well applied to this field for two main reasons. First, most of the existing head detection datasets are only annotated with the center points instead of bounding boxes which is mandatory for the canonical detectors. Second, the sample imbalance has not been overcome yet in highly dense and complex scenes because the existing loss functions calculate the positive loss at a single key point or in the entire target area with the same weight. To address these problems, We propose a novel loss function, called Mask Focal Loss, to unify the loss functions based on heatmap ground truth (GT) and binary feature map GT. Mask Focal Loss redefines the weight of the loss contributions according to the situ value of the heatmap with a Gaussian kernel. For better evaluation and comparison, a new synthetic dataset GTA\_Head is made public, including 35 sequences, 5096 images and 1732043 head labels with bounding boxes. Experimental results show the overwhelming performance and demonstrate that our proposed Mask Focal Loss is applicable to all of the canonical detectors and to various datasets with different GT. This provides a strong basis for surpassing the crowd counting methods based on density estimation.
translated by 谷歌翻译